Goto

Collaborating Authors

 universal guarantee



Low-Complexity Nonparametric Bayesian Online Prediction with Universal Guarantees

Neural Information Processing Systems

We propose a novel nonparametric online predictor for discrete labels conditioned on multivariate continuous features. The predictor is based on a feature space discretization induced by a full-fledged k-d tree with randomly picked directions and a recursive Bayesian distribution, which allows to automatically learn the most relevant feature scales characterizing the conditional distribution. We prove its pointwise universality, i.e., it achieves a normalized log loss performance asymptotically as good as the true conditional entropy of the labels given the features. The time complexity to process the n-th sample point is O(log n) in probability with respect to the distribution generating the data points, whereas other exact nonparametric methods require to process all past observations. Experiments on challenging datasets show the computational and statistical efficiency of our algorithm in comparison to standard and state-of-the-art methods.


Universal guarantees for decision tree induction via a higher-order splitting criterion

Neural Information Processing Systems

We propose a simple extension of {\sl top-down decision tree learning heuristics} such as ID3, C4.5, and CART. Our algorithm achieves provable guarantees for all target functions $f: \{-1,1\}^n \to \{-1,1\}$ with respect to the uniform distribution, circumventing impossibility results showing that existing heuristics fare poorly even for simple target functions. The crux of our extension is a new splitting criterion that takes into account the correlations between $f$ and {\sl small subsets} of its attributes.



Review for NeurIPS paper: Universal guarantees for decision tree induction via a higher-order splitting criterion

Neural Information Processing Systems

Summary and Contributions: This paper considers the problem of learning decision trees. You are given samples from a function f on the Boolean cube that is known to be computed by a size s decision tree. The goal is to produce a hypothesis h that is also a small decision tree and is close to f. It was known that simply looking at correlations is not a good idea, simple functions like parity of a few variables would defeat this algorithm. Indeed, I don't think there was any known algorithm that was guaranteed to return a "decision tree" of small size. This paper presents an algorithm of this type.


Review for NeurIPS paper: Universal guarantees for decision tree induction via a higher-order splitting criterion

Neural Information Processing Systems

The three reviews agree that the paper develops strong theoretical results regarding an important topic. Also the techniques are interesting, and the paper is well written. The main negative aspect in the reviews concerns the practical applicability of the results. Although the authors address this in their reply, the reviewers after discussion are not really convinced about the potential for bridging the gap between theory and practice. Regardless of this, the reviewers are clear in their assessment that the work deserves publications purely on the strength of the theoretical contribution.


Reviews: Low-Complexity Nonparametric Bayesian Online Prediction with Universal Guarantees

Neural Information Processing Systems

A reader unfamiliar with the Context Tree Weighting technique might have a difficult first read, but as it is a well-known technique in information theory and its applications, I don't think this should be held against it. Context Tree Weighting based variants have been used successfully in many different problems (data compression, bioinformatics), but typically deal with relatively low-dimensional binary side information, so this paper provides a method that fills this gap, and in my mind could be built on further by the machine learning community in the future. Some suggestions for future work: - There is a body of literature which improves on the KT estimator on sequences where the entire alphabet isn't observed, which is a common case when the data is recursively partitioned. I am quite certain the method advocated in this approach could benefit from applying the recently introduced SAD estimator (https://arxiv.org/abs/1305.3671) in place of the multi-KT, and the theoretical results extended to this case.


Universal guarantees for decision tree induction via a higher-order splitting criterion

Neural Information Processing Systems

We propose a simple extension of {\sl top-down decision tree learning heuristics} such as ID3, C4.5, and CART. Our algorithm achieves provable guarantees for all target functions f: \{-1,1\} n \to \{-1,1\} with respect to the uniform distribution, circumventing impossibility results showing that existing heuristics fare poorly even for simple target functions. The crux of our extension is a new splitting criterion that takes into account the correlations between f and {\sl small subsets} of its attributes. Gini impurity and information gain), in contrast, are based solely on the correlations between f and its {\sl individual} attributes. Our algorithm satisfies the following guarantee: for all target functions f: \{-1,1\} n \to \{-1,1\}, sizes s\in \N, and error parameters \eps, it constructs a decision tree of size s {\tilde{O}((\log s) 2/\eps 2)} that achieves error \le O(\opt_s) \eps, where \opt_s denotes the error of the optimal size- s decision tree for f .


Low-Complexity Nonparametric Bayesian Online Prediction with Universal Guarantees

Neural Information Processing Systems

We propose a novel nonparametric online predictor for discrete labels conditioned on multivariate continuous features. The predictor is based on a feature space discretization induced by a full-fledged k-d tree with randomly picked directions and a recursive Bayesian distribution, which allows to automatically learn the most relevant feature scales characterizing the conditional distribution. We prove its pointwise universality, i.e., it achieves a normalized log loss performance asymptotically as good as the true conditional entropy of the labels given the features. The time complexity to process the n-th sample point is O(log n) in probability with respect to the distribution generating the data points, whereas other exact nonparametric methods require to process all past observations. Experiments on challenging datasets show the computational and statistical efficiency of our algorithm in comparison to standard and state-of-the-art methods.


Universal guarantees for decision tree induction via a higher-order splitting criterion

arXiv.org Machine Learning

We propose a simple extension of top-down decision tree learning heuristics such as ID3, C4.5, and CART. Our algorithm achieves provable guarantees for all target functions $f: \{-1,1\}^n \to \{-1,1\}$ with respect to the uniform distribution, circumventing impossibility results showing that existing heuristics fare poorly even for simple target functions. The crux of our extension is a new splitting criterion that takes into account the correlations between $f$ and small subsets of its attributes. The splitting criteria of existing heuristics (e.g. Gini impurity and information gain), in contrast, are based solely on the correlations between $f$ and its individual attributes. Our algorithm satisfies the following guarantee: for all target functions $f : \{-1,1\}^n \to \{-1,1\}$, sizes $s\in \mathbb{N}$, and error parameters $\epsilon$, it constructs a decision tree of size $s^{\tilde{O}((\log s)^2/\epsilon^2)}$ that achieves error $\le O(\mathsf{opt}_s) + \epsilon$, where $\mathsf{opt}_s$ denotes the error of the optimal size $s$ decision tree. A key technical notion that drives our analysis is the noise stability of $f$, a well-studied smoothness measure.